In this exercise, we explore various tools for ingesting the MIMIC-IV data introduced in homework 1.
Display the contents of MIMIC hosp and icu data folders:
ls-l ~/mimic/hosp/
total 8859752
-rw-rw-r--@ 1 jacenai staff 15516088 Jan 5 2023 admissions.csv.gz
-rw-rw-r--@ 1 jacenai staff 427468 Jan 5 2023 d_hcpcs.csv.gz
-rw-rw-r--@ 1 jacenai staff 859438 Jan 5 2023 d_icd_diagnoses.csv.gz
-rw-rw-r--@ 1 jacenai staff 578517 Jan 5 2023 d_icd_procedures.csv.gz
-rw-rw-r--@ 1 jacenai staff 12900 Jan 5 2023 d_labitems.csv.gz
-rw-rw-r--@ 1 jacenai staff 25070720 Jan 5 2023 diagnoses_icd.csv.gz
-rw-rw-r--@ 1 jacenai staff 7426955 Jan 5 2023 drgcodes.csv.gz
-rw-rw-r--@ 1 jacenai staff 508524623 Jan 5 2023 emar.csv.gz
-rw-rw-r--@ 1 jacenai staff 471096030 Jan 5 2023 emar_detail.csv.gz
-rw-rw-r--@ 1 jacenai staff 1767138 Jan 5 2023 hcpcsevents.csv.gz
-rw-rw-r--@ 1 jacenai staff 1939088924 Jan 5 2023 labevents.csv.gz
-rw-rw-r--@ 1 jacenai staff 96698496 Jan 5 2023 microbiologyevents.csv.gz
-rw-rw-r--@ 1 jacenai staff 36124944 Jan 5 2023 omr.csv.gz
-rw-rw-r--@ 1 jacenai staff 2312631 Jan 5 2023 patients.csv.gz
-rw-rw-r--@ 1 jacenai staff 398753125 Jan 5 2023 pharmacy.csv.gz
-rw-rw-r--@ 1 jacenai staff 498505135 Jan 5 2023 poe.csv.gz
-rw-rw-r--@ 1 jacenai staff 25477219 Jan 5 2023 poe_detail.csv.gz
-rw-rw-r--@ 1 jacenai staff 458817415 Jan 5 2023 prescriptions.csv.gz
-rw-rw-r--@ 1 jacenai staff 6027067 Jan 5 2023 procedures_icd.csv.gz
-rw-rw-r--@ 1 jacenai staff 122507 Jan 5 2023 provider.csv.gz
-rw-rw-r--@ 1 jacenai staff 6781247 Jan 5 2023 services.csv.gz
-rw-rw-r--@ 1 jacenai staff 36158338 Jan 5 2023 transfers.csv.gz
ls-l ~/mimic/icu/
total 6155968
-rw-rw-r--@ 1 jacenai staff 35893 Jan 5 2023 caregiver.csv.gz
-rw-rw-r--@ 1 jacenai staff 2467761053 Jan 5 2023 chartevents.csv.gz
-rw-rw-r--@ 1 jacenai staff 57476 Jan 5 2023 d_items.csv.gz
-rw-rw-r--@ 1 jacenai staff 45721062 Jan 5 2023 datetimeevents.csv.gz
-rw-rw-r--@ 1 jacenai staff 2614571 Jan 5 2023 icustays.csv.gz
-rw-rw-r--@ 1 jacenai staff 251962313 Jan 5 2023 ingredientevents.csv.gz
-rw-rw-r--@ 1 jacenai staff 324218488 Jan 5 2023 inputevents.csv.gz
-rw-rw-r--@ 1 jacenai staff 38747895 Jan 5 2023 outputevents.csv.gz
-rw-rw-r--@ 1 jacenai staff 20717852 Jan 5 2023 procedureevents.csv.gz
Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)
Q1.1 Speed, memory, and data types
There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three functions: read.csv in base R, read_csv in tidyverse, and fread in the data.table package.
Which function is fastest? Is there difference in the (default) parsed data types? How much memory does each resultant dataframe or tibble use? (Hint: system.time measures run times; pryr::object_size measures memory usage.)
Answer
library(tidyverse)library(data.table)library(pryr)mimic_path <-"~/mimic/hosp/"# reading time for read.csvsystem.time(data_read.csv <-read.csv(str_c(mimic_path,"admissions.csv.gz")))
user system elapsed
3.339 0.064 3.405
# reading time for read_csvsystem.time(data_read_csv <-read_csv(str_c(mimic_path,"admissions.csv.gz")))
user system elapsed
0.939 0.103 0.626
# reading time for freadsystem.time(data_fread <-fread(str_c(mimic_path,"admissions.csv.gz")))
user system elapsed
0.426 0.044 0.521
# memory usage for read.csvpryr::object_size(data_read.csv)
158.71 MB
# memory usage for read_csvpryr::object_size(data_read_csv)
55.31 MB
# memory usage for freadpryr::object_size(data_fread)
In terms of the speed and based on the user time, fread is the fastest, followed by read_csv, and read.csv is the slowest.
In terms of the memory usage of the resultant dataframe or tibble, fread uses the least memory, followed by read_csv, and read.csv uses the most memory.
In terms of the default parsed data types, fread and read_csv are similar if “double” and “integer” are categorized as “numeric” data type; However, read.csv is different in some variables’ data type.
Q1.2 User-supplied data types
Re-ingest admissions.csv.gz by indicating appropriate column data types in read_csv. Does the run time change? How much memory does the result tibble use? (Hint: col_types argument in read_csv.)
No. The run time almost does not change. The memory that the resultant tibble uses is 55.31 MB.
Q2. Ingest big data files
Let us focus on a bigger file, labevents.csv.gz, which is about 125x bigger than admissions.csv.gz.
ls-l ~/mimic/hosp/labevents.csv.gz
Display the first 10 lines of this file.
zcat< ~/mimic/hosp/labevents.csv.gz |head-10
Q2.1 Ingest labevents.csv.gz by read_csv
Try to ingest labevents.csv.gz using read_csv. What happens? If it takes more than 5 minutes on your computer, then abort the program and report your findings.
Q2.2 Ingest selected columns of labevents.csv.gz by read_csv
Try to ingest only columns subject_id, itemid, charttime, and valuenum in labevents.csv.gz using read_csv. Does this solve the ingestion issue? (Hint: col_select argument in read_csv.)
Our first strategy to handle this big data file is to make a subset of the labevents data. Read the MIMIC documentation for the content in data file labevents.csv.
In later exercises, we will only be interested in the following lab items: creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931) and the following columns: subject_id, itemid, charttime, valuenum. Write a Bash command to extract these columns and rows from labevents.csv.gz and save the result to a new file labevents_filtered.csv.gz in the current working directory. (Hint: use zcat < to pipe the output of labevents.csv.gz to awk and then to gzip to compress the output. To save render time, put #| eval: false at the beginning of this code chunk.)
Display the first 10 lines of the new file labevents_filtered.csv.gz. How many lines are in this new file? How long does it take read_csv to ingest labevents_filtered.csv.gz?
Q2.4 Ingest labevents.csv by Apache Arrow
Our second strategy is to use Apache Arrow for larger-than-memory data analytics. Unfortunately Arrow does not work with gz files directly. First decompress labevents.csv.gz to labevents.csv and put it in the current working directory. To save render time, put #| eval: false at the beginning of this code chunk.
Then use arrow::open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3. How long does the ingest+select+filter process take? Display the number of rows and the first 10 rows of the result tibble, and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)
Write a few sentences to explain what is Apache Arrow. Imagine you want to explain it to a layman in an elevator.
Q2.5 Compress labevents.csv to Parquet format and ingest/select/filter
Re-write the csv file labevents.csv in the binary Parquet format (Hint: arrow::write_dataset.) How large is the Parquet file(s)? How long does the ingest+select+filter process of the Parquet file(s) take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)
Write a few sentences to explain what is the Parquet format. Imagine you want to explain it to a layman in an elevator.
Q2.6 DuckDB
Ingest the Parquet file, convert it to a DuckDB table by arrow::to_duckdb, select columns, and filter rows as in Q2.5. How long does the ingest+convert+select+filter process take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)
Write a few sentences to explain what is DuckDB. Imagine you want to explain it to a layman in an elevator.
Q3. Ingest and filter chartevents.csv.gz
chartevents.csv.gz contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are
zcat< ~/mimic/icu/chartevents.csv.gz |head-10
d_items.csv.gz is the dictionary for the itemid in chartevents.csv.gz.
zcat< ~/mimic/icu/d_items.csv.gz |head-10
In later exercises, we are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items, using the favorite method you learnt in Q2.
Document the steps and show code. Display the number of rows and the first 10 rows of the result tibble.